Prediction w/ tidyModels

Author

Nick Duran

Published

November 16, 2023

PART 1: What are tidyModels

A modeling framework!

```{r load-tm}
#| message: true
library(tidymodels)
```
#> ── Attaching packages ──────────────────────────── tidymodels 1.1.1 ──
#> ✔ broom        1.0.5     ✔ rsample      1.2.0
#> ✔ dials        1.2.0     ✔ tibble       3.2.1
#> ✔ dplyr        1.1.3     ✔ tidyr        1.3.0
#> ✔ infer        1.0.4     ✔ tune         1.1.2
#> ✔ modeldata    1.2.0     ✔ workflows    1.1.3
#> ✔ parsnip      1.1.1     ✔ workflowsets 1.0.1
#> ✔ purrr        1.0.2     ✔ yardstick    1.2.0
#> ✔ recipes      1.0.8
#> ── Conflicts ─────────────────────────────── tidymodels_conflicts() ──
#> ✖ purrr::discard() masks scales::discard()
#> ✖ dplyr::filter()  masks stats::filter()
#> ✖ dplyr::lag()     masks stats::lag()
#> ✖ recipes::step()  masks stats::step()
#> • Use suppressPackageStartupMessages() to eliminate package startup messages

What we will be working on today

  • Minimal version of predictive modeling process

What we will not be working on today but that is in the recorded lectures

  • Feature engineering (recipes)

Machine learning as a predictive modeling process

Your turn

They are sometimes thought of as capturing “two cultures:” Model first vs. data first; or inference vs. prediction

The whole game

Splitting data (aka your data budget) and building a decision tree

Cross validation, i.e., resampling

  • Comparing multiple model types and evaluating to identify best one

Creating a final model fit object and doing a final test

Data

Taxi trips in Chicago in 2022

The city of Chicago releases anonymized trip-level data on taxi trips in the city. We will use a dataset that pulls a sample of 10,000 rides occurring in early 2022.

Tip

Type ?taxi to learn more about this dataset, including references.

```{r}
library(tidymodels)
```
```{r}
glimpse(taxi)
```
#> Rows: 10,000
#> Columns: 7
#> $ tip      <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, y…
#> $ distance <dbl> 17.19, 0.88, 18.11, 20.70, 12.23, 0.94, 17.47, 17.67, 1.85, 1…
#> $ company  <fct> Chicago Independents, City Service, other, Chicago Independen…
#> $ local    <fct> no, yes, no, no, no, yes, no, no, no, no, no, no, no, yes, no…
#> $ dow      <fct> Thu, Thu, Mon, Mon, Sun, Sat, Fri, Sun, Fri, Tue, Tue, Sun, W…
#> $ month    <fct> Feb, Mar, Feb, Apr, Mar, Apr, Mar, Jan, Apr, Mar, Mar, Apr, A…
#> $ hour     <int> 16, 8, 18, 8, 21, 23, 12, 6, 12, 14, 18, 11, 12, 19, 17, 13, …
Note

tip: Whether the rider left a tip. A factor with levels “yes” and “no”.

distance: The trip distance, in odometer miles.

company: The taxi company, as a factor. Companies that occurred few times were binned as “other”.

local: Whether the trip started in the same community area as it began. See the source data for community area values.

dow: The day of the week in which the trip began, as a factor.

month: The month in which the trip began, as a factor.

hour: The hour of the day in which the trip began, as a numeric.

Checklist for predictors

  • Is it ethical to use this variable? (Or even legal?)
  • Will this variable be available at prediction time?
  • Does this variable contribute to explainability?

PART 2: Data Budget

Data splitting and spending

Always have a separate piece of data that can contradict what you believe!

For machine learning, we typically split data into training and test sets:

  • The training set is used to estimate model parameters.
  • The test set is used to find an independent assessment of model performance. Do not 🚫 use the test set during training. The test set is precious 💎

  • Spending too much data in training prevents us from computing a good assessment of predictive performance.

  • Spending too much data in testing prevents us from computing a good estimate of model parameters.

Your turn

  • When is a good time to split your data?

The initial split

```{r taxi-split}
set.seed(123)
taxi_split <- initial_split(taxi)
taxi_split
```
#> <Training/Testing/Total>
#> <7500/2500/10000>
How much data in training vs testing?

This function uses a good default, but this depends on your specific goal/data. There are more powerful ways of splitting, like stratification, that we will get into later.

What is set.seed()?

To create that split of the data, R generates “pseudo-random” numbers: while they are made to behave like random numbers, their generation is deterministic give a “seed”.

This allows us to reproduce results by setting that seed.

Which seed you pick doesn’t matter, as long as you don’t try a bunch of seeds and pick the one that gives you the best performance.

Accessing the data

```{r}
#| label: taxi-train-test
taxi_train <- training(taxi_split)
taxi_test <- testing(taxi_split)
```

The training set

```{r}
#| label: taxi-train
taxi_train
```
#> # A tibble: 7,500 × 7
#>    tip   distance company                   local dow   month  hour
#>    <fct>    <dbl> <fct>                     <fct> <fct> <fct> <int>
#>  1 yes       0.7  Taxi Affiliation Services yes   Tue   Mar      18
#>  2 yes       0.99 Sun Taxi                  yes   Tue   Jan       8
#>  3 yes       1.78 other                     no    Sat   Mar      22
#>  4 yes       0    Taxi Affiliation Services yes   Wed   Apr      15
#>  5 yes       0    Taxi Affiliation Services no    Sun   Jan      21
#>  6 yes       2.3  other                     no    Sat   Apr      21
#>  7 yes       6.35 Sun Taxi                  no    Wed   Mar      16
#>  8 yes       2.79 other                     no    Sun   Feb      14
#>  9 yes      16.6  other                     no    Sun   Apr      18
#> 10 yes       0.02 Chicago Independents      yes   Sun   Apr      15
#> # ℹ 7,490 more rows

The test set

🙈

There are 2500 rows and 7 columns in the test set.

Your turn

  • Split your data so 20% is held out for the test set and show the first 10 rows of so of the training data

  • Try out different values in set.seed() to see how the results change.

Exploratory data analysis

As you know by now, before we get too far into our data analysis, we want to do some exploration with visualizations. So lets get to it.

Your turn

  • What’s the distribution of the outcome, tip?

  • What’s the distribution of numeric variables like distance?

  • How does tip differ across the categorical variables?

For taxi-tip-counts, what about using geom_bar() with just tip?

For taxi-tip-by-distance, I would recommend geom_histograms and play around with the bin size. You will also want to incorporate facet_grid.

For taxi-tip-by-hour, I would plot making use of fill= and geom_bar

For taxi-tip-by-local, I’ll leave this one up to you, but add a colore palette for color-blind people.

Split smarter

Based on our EDA, we know that the source data contains fewer "no" tip values than "yes". We want to make sure we allot equal proportions of those responses so that both the training and testing data have enough of each to give accurate estimates.

Stratified sampling would split within response values

Tip
  • For stratification involving regression, determine the quartiles of the data set and sample within those artificial groups
  • For time series, we often use the most recent data as the test set

Your turn

  • Go back and modify your earlier code where you split the data to stratify by tip

PART 3: Fitting a model

Do it the tidymodels way

  • Choose a model
  • Specify an engine
  • Set the mode

Choose a model

The model type differentiates basic modeling approaches, such as linear regression, logistic regression, linear support vector machines, etc.

What models are available to you?

All available models are listed at https://www.tidymodels.org/find/parsnip/

How about we start with a good ol’ linear regression model?

```{r logistic-reg}
linear_reg()
```
#> Linear Regression Model Specification (regression)
#> 
#> Computational engine: lm

Specify an engine

Question: How many ways are there to fit a linear model in R?

Lots of ways!

  • lm for linear model (default)

  • glm for generalized linear model

  • glmnet for regularized regression

  • keras for regression using TensorFlow

  • stan for Bayesian regression

  • spark for large data sets

The computational engine indicates how the model is fit, such as with a specific R package implementation or even methods outside of R like Keras or Stan.

Note that some models when you call them have a default engine associated with them, but these can be easily overridden depending on your needs
```{r logistic-reg-glmnet}
linear_reg() %>% 
  set_engine("glm", family = stats::poisson(link = "log"))
```
#> Linear Regression Model Specification (regression)
#> 
#> Engine-Specific Arguments:
#>   family = stats::poisson(link = "log")
#> 
#> Computational engine: glm
```{r logistic-reg-stan}
linear_reg() %>%
  set_engine("stan")
```
#> Linear Regression Model Specification (regression)
#> 
#> Computational engine: stan

Set the mode

The mode denotes in what kind of modeling context it will be used (most commonly, classification or regression)

Note

Note that some models have a default mode, some do not

```{r decision-tree}
decision_tree()
```
#> Decision Tree Model Specification (unknown mode)
#> 
#> Computational engine: rpart
```{r decision-tree-2}
decision_tree() %>% 
  set_mode("classification")
```
#> Decision Tree Model Specification (classification)
#> 
#> Computational engine: rpart

Your turn

  • Copy the chunk below and edit the code to create a logistic regression model

```{r}
#| eval: false

# Model
linear_reg()

# Engine
linear_reg() %>%
  set_engine("glmnet")

# Mode (some models have a default mode, others don't)
decision_tree() %>% 
  set_mode("regression")
```

Models we’ll be using today

  • Logistic regression (you)
  • Decision trees (me)

Logistic regression

The concept of odds and probability:

  • The probability of some event happening can be expressed as a value between 0 and 1
  • If the probability of an event occurring is \({p}\), the odds of an event occurring can be expressed as \(\frac{p}{1 - p}\)

“If there is a 75% chance (or 0.75 probability) that it will rain today, then… the odds of it raining are 3 to 1, meaning that it is 3 times more likely to rain than not.”

From odds to log-odds:

  • It’s challenging to work with odds in mathematical modeling, especially when there are multiple factors that might affect an outcome
  • This is why we take the logarithm of the odds which puts it on a scale that is easier to work with

The Logistic Regression Equation:

\(log(\frac{p}{1 - p}) = \beta_0 + \beta_1\cdot \text{A}\)

  • \(log(\frac{p}{1 - p})\) is the log-odds (or the logit) of the probability of our event of interest (e.g., getting a tip)
  • \(\beta_0\) is the intercept, which represents the log-odds of the event when predictor A is zero
  • \(\beta_1\cdot \text{A}\) is the slope of coefficient for predictor A, indicating how much the log-odds of the event changes with a one-unit change in A
Note

It is common to convert the log-odds units into what is called an odds ratio (OR)

Relating this back to taxi’s and tips

  • We want to predict whether someone will tip (“one”) or not based on the trip distance, in odometer meters (A).
  • \(\beta_0\) would represent the log-odds of receiving a tip with zero distance
  • \(\beta_1\) shows how the log-odds of receiving a tip with each additional meter in distance
  • Maybe as distance increases, the probability of getting a tip increases? (but not linearly- there are diminishing returns after a certain point)

A possibly helpful visual

  • Lets simulate a dataset of 500 observations with a single predictor A using our logistic regression equation. We ensure that the log-odds of the binary outcome (class with levels “zero” and “one”) increases with respect to A with an intercept of .1 and a slope of 2.
  • Estimate the probability (converting log-odds to probabilities) of the “one” class for different values of A and plot these (A values are binned in 0.5 increments; error bars show the confidence intervals for estimates)
  • The visual demonstrates the concept of the odds and probabilities changing in a non-linear fashion across the range of A, a key concept in logistic regression.

What a logistic regression is doing is trying to fit this sort of sigmoid line to separate the two classes in an outcome.

Decision trees

An analogy

  • Let’s play 20 questions: I’m thinking of a person, place, or thing. Guess it with asking “yes” or “no” questions

A slightly more formal characterization

  • A decision tree is a flowchart-like tree structure consisting of splits or if/then statements based on predictors

But to explain, visualizations are best for decision trees

0 corresponds to “Did Not Survive”; 1 corresponds to “Survived”

Pclass: Passenger Class. This is a categorical feature indicating the class of travel of the passenger aboard the Titanic. It typically has three categories: - 1 for first-class (the highest class), - 2 for second-class, - 3 for third-class (the lowest class).

Sex: The sex of the passenger (male or female). This is a binary categorical feature.

Age: The age of the passenger. This is a continuous numerical feature.

SibSp: The number of siblings or spouses the passenger had aboard the Titanic. This is a discrete numerical feature.

Parch: The number of parents or children the passenger had aboard the Titanic. This is also a discrete numerical feature. Note that some relations, like nannies or friends, may not be included in this count.

Fare: The amount of money the passenger paid for the ticket. This is a continuous numerical feature.

A few things to keep in mind

  • A tree chooses a feature and a split point (e.g., if Age >= 6.5) to partition the data into subsets that are as homogeneous “pure” as possible (meaning that the node contains data points from predominantly one class)
    • Just know there are measures for evaluating the “purity” of a node, e.g., Gini Impurity, Entropy, Classification Error
  • This process is repeated recursively, forming a tree structure until certain stopping criteria are met (e.g., when a node has a small number of observations, some maximum depth).
  • Trees are also pruned to reduce its complexity

Another possibly helpful visual

Model workflows

  • In tidymodels, there is this idea that a model-oriented data analysis consists of a preprocessor and a model, and these should be built into a single “workflow” object
    • Preprocessors range from simple formulas y ~ x1 + x2 to sophisticated recipes that allow feature engineering operations (e.g., extract features using a PCA), remove variables with zero variances, categorizes missing categorical data (NA’s) as unknown, dummy codes categorical variables, and lots more
    • Models allow different ways to fit the data
  • By packaging preprocessing steps and modeling in one workflow, lots of advantages!
    • Ensures that new data goes through the same preprocessing steps as training data
    • Ensures consistency and reproducibility
    • They can help organize your work when working with multiple models
    • Handles new data better than base R tools in terms of new factor levels

A model workflow

```{r tree-wflow}
#logi_spec <-
tree_spec <-                            # set up all the specs on model
  decision_tree(cost_complexity = 0.002) %>%   
  set_mode("classification")

#logi_wflow <-
tree_wflow <- workflow() %>%            # build the workflow
  add_formula(tip ~ .) %>%              # preprocessor steps (could be lots!)
  add_model(tree_spec)                  # add model

#logi_fit <-                            
tree_fit <-                             # now fit to training data 
  tree_wflow %>% 
  fit(data = taxi_train) 
```
Tip

“Shortcut” by specifying the preprocessor and model specs directly in the workflow() call:

workflow(tip ~ ., tree_spec) %>% 
  fit(data = taxi_train) 

Your turn

  • Copy the chunk above “{r tree-wflow}” and edit the code to create a workflow for a logistic regression model

    • Make sure you swap out tree_* for logi_* when saving each set of code

Pre-processing: Adding “recipes”

  • Hypothetical additions just for illustrative purposes
```{r}
#| eval: false
tree_rec <-
  step_date(some_var1, features = c("dow", "month", "year")) %>% 
  step_rm(some_var1) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors()) %>%
  step_pca(all_numeric_predictors())

#logi_wflow <-
tree_wflow <- workflow() %>%            # build the workflow
  add_formula(tip ~ .) %>%              # preprocessor: add formula
  add_recipe(tree_rec) %>%              # preprocessor: add recipe
  add_model(tree_spec)                  # add model
```

How do you understand your new tree_fit model?

Predict with your model

```{r}
predict(tree_fit, new_data = taxi_test)
```
#> # A tibble: 2,000 × 1
#>    .pred_class
#>    <fct>      
#>  1 yes        
#>  2 yes        
#>  3 yes        
#>  4 yes        
#>  5 yes        
#>  6 yes        
#>  7 yes        
#>  8 yes        
#>  9 yes        
#> 10 yes        
#> # ℹ 1,990 more rows
```{r}
augment(tree_fit, new_data = taxi_test)
```
#> # A tibble: 2,000 × 10
#>    tip   distance company local dow   month  hour .pred_class .pred_yes .pred_no
#>    <fct>    <dbl> <fct>   <fct> <fct> <fct> <int> <fct>           <dbl>    <dbl>
#>  1 yes      20.7  Chicag… no    Mon   Apr       8 yes             0.967   0.0333
#>  2 yes       1.47 City S… no    Tue   Mar      14 yes             0.917   0.0830
#>  3 yes       1    Taxi A… no    Mon   Feb      18 yes             0.917   0.0830
#>  4 yes       1.91 Flash … no    Wed   Apr      15 yes             0.917   0.0830
#>  5 yes      17.2  City S… no    Mon   Apr       9 yes             0.967   0.0333
#>  6 yes      17.8  City S… no    Mon   Mar       9 yes             0.967   0.0333
#>  7 yes       0.53 Taxica… yes   Wed   Apr       8 yes             0.917   0.0830
#>  8 yes       1.77 other   no    Thu   Apr      15 yes             0.917   0.0830
#>  9 yes      18.6  Flash … no    Thu   Apr      12 yes             0.967   0.0333
#> 10 no        1.13 other   no    Sat   Feb      14 yes             0.917   0.0830
#> # ℹ 1,990 more rows
The tidymodels prediction guarantee!
  • The predictions will always be inside a tibble
  • The column names and types are unsurprising and predictable
  • The number of rows in new_data and the output are the same

Bonus: For decision trees, visualization

Additional methods for understanding models
  • Overall variable importance, such as with the vip package

  • Flexible model explainers, such as with the DALEXtra package. Works by removing a variable and see the effect on prediction, if the variable is important it will really wreck with the fit accuracy

Learn more at https://www.tmwr.org/explain.html

Your turn

  • For the fitted logistic regression model, use augment, then glance and tidy. What happens with each?

PART 4: Metrics for model performance

Warning

In the following we will be working on evaluation metrics based on the training set data, but keep in mind that you also need to run the metrics on the test data. If you fail to do so, it can give a misleading impression of model performance. Hopefully this will be clearer soon.

```{r taxi-fit-augment}
augment(tree_fit, new_data = taxi_train) %>%
  relocate(tip, .pred_class, .pred_yes, .pred_no)
```
#> # A tibble: 8,000 × 10
#>    tip   .pred_class .pred_yes .pred_no distance company local dow   month  hour
#>    <fct> <fct>           <dbl>    <dbl>    <dbl> <fct>   <fct> <fct> <fct> <int>
#>  1 yes   yes             0.967   0.0333    17.2  Chicag… no    Thu   Feb      16
#>  2 yes   yes             0.917   0.0830     0.88 City S… yes   Thu   Mar       8
#>  3 yes   yes             0.967   0.0333    18.1  other   no    Mon   Feb      18
#>  4 yes   yes             0.858   0.142     12.2  Chicag… no    Sun   Mar      21
#>  5 yes   yes             0.917   0.0830     0.94 Sun Ta… yes   Sat   Apr      23
#>  6 yes   yes             0.967   0.0333    17.5  Flash … no    Fri   Mar      12
#>  7 yes   yes             0.967   0.0333    17.7  other   no    Sun   Jan       6
#>  8 yes   yes             0.917   0.0830     1.85 Taxica… no    Fri   Apr      12
#>  9 yes   yes             0.917   0.0830     0.53 Sun Ta… no    Tue   Mar      18
#> 10 yes   yes             0.858   0.142      6.65 Taxica… no    Sun   Apr      11
#> # ℹ 7,990 more rows

Confusion matrix

conf_mat() can be used to see how well the model is doing at prediction

```{r conf-mat-plot}
augment(tree_fit, new_data = taxi_train) %>%
  conf_mat(truth = tip, estimate = .pred_class) %>%
  autoplot(type = "heatmap")
```

Accuracy

This is a straightforward measure of the proportion of correct predictions out of all predictions.

```{r acc-2}
augment(tree_fit, new_data = taxi_train) %>%
  mutate(.pred_class = factor("yes", levels = c("yes", "no"))) %>%
  accuracy(truth = tip, estimate = .pred_class)
```
#> # A tibble: 1 × 3
#>   .metric  .estimator .estimate
#>   <chr>    <chr>          <dbl>
#> 1 accuracy binary         0.923
Dangers of accuracy

We need to be careful of using accuracy() since it can give “good” performance by only predicting one way with imbalanced data

Sensitivity

Where accuracy provides a quick understanding of how often the model is correct, sensitivity is an important measure when the cost of missing a positive case (false negative) is high. In our scenario, a positive case is the outcome variable coded as “1” (tip: yes). In other scenarios, this is more consequential (such as correctly predicting the percentage of sick people who have some targeted condition).

```{r sens}
augment(tree_fit, new_data = taxi_train) %>%
  sensitivity(truth = tip, estimate = .pred_class)
```
#> # A tibble: 1 × 3
#>   .metric     .estimator .estimate
#>   <chr>       <chr>          <dbl>
#> 1 sensitivity binary         0.998

Specificity

This is essentially the compliment of sensitivity. Specificity is an important measure of how well the model correctly predicts the true negative. In our scenario, a true negative is the outcome variable coded as “0” (tip: no). Pay attention to specificity when the cost of a false positive is high (like convicting an innocent person).

```{r spec}
augment(tree_fit, new_data = taxi_train) %>%
  specificity(truth = tip, estimate = .pred_class)
```
#> # A tibble: 1 × 3
#>   .metric     .estimator .estimate
#>   <chr>       <chr>          <dbl>
#> 1 specificity binary        0.0390

Combining the metrics

We can use metric_set() to combine multiple calculations into a single table

```{r taxi-metrics}
taxi_metrics <- metric_set(accuracy, specificity, sensitivity)

augment(tree_fit, new_data = taxi_train) %>%
  taxi_metrics(truth = tip, estimate = .pred_class)
```
#> # A tibble: 3 × 3
#>   .metric     .estimator .estimate
#>   <chr>       <chr>          <dbl>
#> 1 accuracy    binary        0.924 
#> 2 specificity binary        0.0390
#> 3 sensitivity binary        0.998

All yardstick metric functions work with grouped data frames!

```{r taxi-metrics-grouped}
taxi_metrics <- metric_set(accuracy, specificity, sensitivity)

augment(tree_fit, new_data = taxi_train) %>%
  group_by(local) %>%
  taxi_metrics(truth = tip, estimate = .pred_class)
```
#> # A tibble: 6 × 4
#>   local .metric     .estimator .estimate
#>   <fct> <chr>       <chr>          <dbl>
#> 1 yes   accuracy    binary        0.893 
#> 2 no    accuracy    binary        0.932 
#> 3 yes   specificity binary        0.0181
#> 4 no    specificity binary        0.0467
#> 5 yes   sensitivity binary        1     
#> 6 no    sensitivity binary        0.998

Your turn

  • Compute a confusion matrix for your logistic model and produce a table that shows its accuracy, specificity, sensitivity scores

An important aside: Two class data

These metrics assume that we know the threshold for converting “soft” probability predictions into “hard” class predictions.

Is a 50% threshold good?

What happens if we say that we need to be 80% sure to declare an event?

  • sensitivity ⬇️, specificity ⬆️ In other words: more likely to predict negatives(no tips)

What happens for a 20% threshold?

  • sensitivity ⬆️, specificity ⬇️ In other words: more likely to predict positives(yes tips)

ROC curves

An ROC (receiver operator characteristic) curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system.

It primarily:

  • Calculates the sensitivity and specificity for all possible thresholds. It helps you (in a visual way) understand whether a model is optimally balancing sensitivity and specificity in a way that makes sense

It is created by:

  • Plotting true positive rate (TPR, or sensitivity) against the false positive rate (FPR, 1-sensitivity) at various threshold settings
  • In other words: plotting the rate in which the models correctly predict “tip=yes” and the rate in which the models mistakenly predict that “tip=yes” at various points above which a prediction is considered to be of the positive class.
Why 1-sensitivity for FPR?

Given that sensitivity is the true positive rate, and specificity is the true negative rate. Hence 1 - specificity is the false positive rate.

Here’s the ROC curve for our model based on the training data:

```{r roc-curve}
#| fig-width: 6
#| fig-height: 6
#| output-location: "column"

augment(tree_fit, new_data = taxi_train) %>% 
  roc_curve(truth = tip, .pred_yes) %>%
  autoplot() +
  labs(x="FPR, 1-specificity", y="TPR, or sensitivity")
```

How do we interpret:

  • A high Area Under the Curve (AUC) indicates that for the majority of thresholds, the TPR (sensitivity) was high and the FPR was low. It means that the model has a good measure of separability and is capable of distinguishing between the positive and negative classes across a wide range of thresholds.

  • ROC AUC = 1 💯

  • ROC AUC = 1/2 😢

Let’s get the actual value for AUC:

```{r roc-auc}
augment(tree_fit, new_data = taxi_train) %>% 
  roc_auc(truth = tip, .pred_yes)
```
#> # A tibble: 1 × 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 roc_auc binary         0.611
Note

ROC curves are insensitive to class imbalance.

Note

When you look at a ROC curve, each point corresponds to a threshold, but the curve itself does not tell you what these thresholds are. To find out, you would need to refer back to the data used to create the ROC curve.

Your turn

  • Compute and plot an ROC curve for your logistic model

PART 5: Using resampling to estimate performance

⚠️ Dangers of Overfitting ⚠️

Models can easily be overfitted to the training data, meaning they perform well on the training data but poorly on unseen data.

Fitting to your training data

Problems when generalizing to test data

Is our model doing well on training but poorly on new data?

Here’s our model performance on the training data

```{r augment-train}
tree_fit %>%
  augment(taxi_train)
```
#> # A tibble: 8,000 × 10
#>    tip   distance company local dow   month  hour .pred_class .pred_yes .pred_no
#>    <fct>    <dbl> <fct>   <fct> <fct> <fct> <int> <fct>           <dbl>    <dbl>
#>  1 yes      17.2  Chicag… no    Thu   Feb      16 yes             0.967   0.0333
#>  2 yes       0.88 City S… yes   Thu   Mar       8 yes             0.917   0.0830
#>  3 yes      18.1  other   no    Mon   Feb      18 yes             0.967   0.0333
#>  4 yes      12.2  Chicag… no    Sun   Mar      21 yes             0.858   0.142 
#>  5 yes       0.94 Sun Ta… yes   Sat   Apr      23 yes             0.917   0.0830
#>  6 yes      17.5  Flash … no    Fri   Mar      12 yes             0.967   0.0333
#>  7 yes      17.7  other   no    Sun   Jan       6 yes             0.967   0.0333
#>  8 yes       1.85 Taxica… no    Fri   Apr      12 yes             0.917   0.0830
#>  9 yes       0.53 Sun Ta… no    Tue   Mar      18 yes             0.917   0.0830
#> 10 yes       6.65 Taxica… no    Sun   Apr      11 yes             0.858   0.142 
#> # ℹ 7,990 more rows
```{r augment-acc}
tree_fit %>%
  augment(taxi_train) %>%
  accuracy(tip, .pred_class)
```
#> # A tibble: 1 × 3
#>   .metric  .estimator .estimate
#>   <chr>    <chr>          <dbl>
#> 1 accuracy binary         0.924

Here’s our model performance on the test data (unseen data)

```{r augment-acc-test}
tree_fit %>%
  augment(taxi_test) %>%
  accuracy(tip, .pred_class)
```
#> # A tibble: 1 × 3
#>   .metric  .estimator .estimate
#>   <chr>    <chr>          <dbl>
#> 1 accuracy binary         0.914

Better to fit on training data via a process of resampling

Resampling is basically an empirical simulation system used to understand how well the model would work on new data. It’s a more robust estimate of the model’s performance

V-fold (K-fold) Cross-validation

  • We want to repeatedly sample (“fold”) the training data to create unique subsets (e.g., Resample 1, Resample 2, Resample B) for analysis and assessment

  • Here, we randomly split the training data into V (or K) distinct blocks of roughly equal size (AKA the “folds”)

  • In Fold 1:
    • We leave out the first block of analysis data and fit a model
    • This model is used to predict the held-out block of assessment data
  • We continue this process until we’ve predicted all V assessment blocks
  • This process ensures that every observation from the original dataset has the chance of appearing in the analysis and assessment set

Other advantages of resampling with V-fold cross-validation
  • Efficient use of data: When the amount of data is limited, it’s a way to use every data point in both the training and validation

  • Model tuning: Useful method for selecting the model with the best tuning parameters (hyperparameters). You can perform cross-validation for various combinations of parameters and choose the one that performs best.

  • Comparing models What if we want to compare more models to see if there are important differences? Cross-validation allows us to do just this with just the training data

Your turn

  • If we use 10 folds, what percent of the training data ends up in ANALYSIS for each fold? for ASSESSMENT for each fold?

Let’s do some cross-validation

```{r vfold-cv}
vfold_cv(taxi_train) # v = 10 is default
```
#> #  10-fold cross-validation 
#> # A tibble: 10 × 2
#>    splits             id    
#>    <list>             <chr> 
#>  1 <split [7200/800]> Fold01
#>  2 <split [7200/800]> Fold02
#>  3 <split [7200/800]> Fold03
#>  4 <split [7200/800]> Fold04
#>  5 <split [7200/800]> Fold05
#>  6 <split [7200/800]> Fold06
#>  7 <split [7200/800]> Fold07
#>  8 <split [7200/800]> Fold08
#>  9 <split [7200/800]> Fold09
#> 10 <split [7200/800]> Fold10

What is in this?

```{r taxi-splits}
taxi_folds <- vfold_cv(taxi_train)
taxi_folds$splits[1:3]
```
#> [[1]]
#> <Analysis/Assess/Total>
#> <7200/800/8000>
#> 
#> [[2]]
#> <Analysis/Assess/Total>
#> <7200/800/8000>
#> 
#> [[3]]
#> <Analysis/Assess/Total>
#> <7200/800/8000>
Note

These separate sets are in a list column, which is a way to store non-atomic types in a dataframe

We can do this with how many folds we want

```{r vfold-cv-v}
vfold_cv(taxi_train, v = 5)
```
#> #  5-fold cross-validation 
#> # A tibble: 5 × 2
#>   splits              id   
#>   <list>              <chr>
#> 1 <split [6400/1600]> Fold1
#> 2 <split [6400/1600]> Fold2
#> 3 <split [6400/1600]> Fold3
#> 4 <split [6400/1600]> Fold4
#> 5 <split [6400/1600]> Fold5

Stratification often helps, with very little downside

```{r vfold-cv-strata}
vfold_cv(taxi_train, strata = tip)
```
#> #  10-fold cross-validation using stratification 
#> # A tibble: 10 × 2
#>    splits             id    
#>    <list>             <chr> 
#>  1 <split [7200/800]> Fold01
#>  2 <split [7200/800]> Fold02
#>  3 <split [7200/800]> Fold03
#>  4 <split [7200/800]> Fold04
#>  5 <split [7200/800]> Fold05
#>  6 <split [7200/800]> Fold06
#>  7 <split [7200/800]> Fold07
#>  8 <split [7200/800]> Fold08
#>  9 <split [7200/800]> Fold09
#> 10 <split [7200/800]> Fold10

We’ll use this setup:

Important

Set the seed when creating resamples

```{r taxi-folds}
set.seed(123)
taxi_folds <- vfold_cv(taxi_train, v = 10, strata = tip)
taxi_folds
```
#> #  10-fold cross-validation using stratification 
#> # A tibble: 10 × 2
#>    splits             id    
#>    <list>             <chr> 
#>  1 <split [7200/800]> Fold01
#>  2 <split [7200/800]> Fold02
#>  3 <split [7200/800]> Fold03
#>  4 <split [7200/800]> Fold04
#>  5 <split [7200/800]> Fold05
#>  6 <split [7200/800]> Fold06
#>  7 <split [7200/800]> Fold07
#>  8 <split [7200/800]> Fold08
#>  9 <split [7200/800]> Fold09
#> 10 <split [7200/800]> Fold10

Fit our model to the resamples!

We will fit 10 models on 10 slightly different analysis sets.

```{r fit-resamples}
taxi_res <- fit_resamples(tree_wflow, taxi_folds)
taxi_res
```
#> # Resampling results
#> # 10-fold cross-validation using stratification 
#> # A tibble: 10 × 4
#>    splits             id     .metrics         .notes          
#>    <list>             <chr>  <list>           <list>          
#>  1 <split [7200/800]> Fold01 <tibble [2 × 4]> <tibble [0 × 3]>
#>  2 <split [7200/800]> Fold02 <tibble [2 × 4]> <tibble [0 × 3]>
#>  3 <split [7200/800]> Fold03 <tibble [2 × 4]> <tibble [0 × 3]>
#>  4 <split [7200/800]> Fold04 <tibble [2 × 4]> <tibble [0 × 3]>
#>  5 <split [7200/800]> Fold05 <tibble [2 × 4]> <tibble [0 × 3]>
#>  6 <split [7200/800]> Fold06 <tibble [2 × 4]> <tibble [0 × 3]>
#>  7 <split [7200/800]> Fold07 <tibble [2 × 4]> <tibble [0 × 3]>
#>  8 <split [7200/800]> Fold08 <tibble [2 × 4]> <tibble [0 × 3]>
#>  9 <split [7200/800]> Fold09 <tibble [2 × 4]> <tibble [0 × 3]>
#> 10 <split [7200/800]> Fold10 <tibble [2 × 4]> <tibble [0 × 3]>
Alternate resampling schemes

Bootstrapping

  • Assess model accuracy and stability

  • From your original training dataset, create many (often thousands) of new “bootstrap” training datasets by sampling with replacement.

  • After running the bootstrap process many times, you’ll end up with a distribution of performance metrics (like accuracy or AUC).

Handling Small Datasets: It is particularly useful when dealing with small datasets where you might not be able to afford to set aside a portion of the data as a test set.

Uncertainty Estimates: It provides a way to create confidence intervals for various performance metrics, giving a sense of how uncertain those metrics are.

```{r bootstraps}
set.seed(3214)
bootstraps(taxi_train)
```
#> # Bootstrap sampling 
#> # A tibble: 25 × 2
#>    splits              id         
#>    <list>              <chr>      
#>  1 <split [8000/2902]> Bootstrap01
#>  2 <split [8000/2916]> Bootstrap02
#>  3 <split [8000/3004]> Bootstrap03
#>  4 <split [8000/2979]> Bootstrap04
#>  5 <split [8000/2961]> Bootstrap05
#>  6 <split [8000/2962]> Bootstrap06
#>  7 <split [8000/3026]> Bootstrap07
#>  8 <split [8000/2926]> Bootstrap08
#>  9 <split [8000/2972]> Bootstrap09
#> 10 <split [8000/2972]> Bootstrap10
#> # ℹ 15 more rows

https://rsample.tidymodels.org/reference/index.html

Your turn

  • Now run vfold_cv to get 10 splits and refit your logistic regression workflow (using fit_resamples)

Evaluating model performance

  • The goal of resampling is to produce a single estimate of performance for a model
  • The final performance is based on the hold-out predictions by averaging the statistics from the V folds.
Note

Even though we end up estimating V models, these models are discarded after we have our performance estimate. 🗑

```{r collect-metrics}
taxi_res %>%
  collect_metrics()
```
#> # A tibble: 2 × 6
#>   .metric  .estimator  mean     n std_err .config             
#>   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
#> 1 accuracy binary     0.923    10 0.00320 Preprocessor1_Model1
#> 2 roc_auc  binary     0.592    10 0.0126  Preprocessor1_Model1

We can reliably measure performance using only the training data 🎉

Note

collect_metrics() is one of a suite of collect_*() functions that can be used to work with columns of tuning results. Most columns in a tuning result prefixed with . have a corresponding collect_*() function with options for common summaries.

Your turn

  • Use collect_metrics() to get a single estimates of performance on your cross-validated logistic regression

  • Generate a new confusion matrix for your cross-validated logistic regression

Does our model perfomance change when fit once on the training data vs resampling on the training data?

Do you remember this from way back when? This is the ROC AUC from no resampling:

```{r roc-auc-2}
augment(tree_fit, new_data = taxi_train) %>% 
  roc_auc(truth = tip, .pred_yes)
```
#> # A tibble: 1 × 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 roc_auc binary         0.611

Now, this is from resampling:

```{r}
taxi_res %>%
  collect_metrics() %>% 
  select(.metric, mean, n)
```
#> # A tibble: 2 × 3
#>   .metric   mean     n
#>   <chr>    <dbl> <int>
#> 1 accuracy 0.923    10
#> 2 roc_auc  0.592    10

Lesson is that the fitting a model on just the training set (without resampling) gives you overly optimistic metrics ⚠️

What you would report

  • Training Performance (Cross-Validation):
    • Present metrics that summarize the model’s performance across the folds
      • mean and standard deviation of accuracy
      • AUC
      • sensitivity
      • specificity

This gives an indication of how well the model is expected to perform on unseen data, taking into account the variability due to different subsets of the training data.

  • Test Set Performance:
```{r}
taxi_metrics <- metric_set(accuracy, specificity, sensitivity)

augment(tree_fit, new_data = taxi_test) %>%
  taxi_metrics(truth = tip, estimate = .pred_class)
```
#> # A tibble: 3 × 3
#>   .metric     .estimator .estimate
#>   <chr>       <chr>          <dbl>
#> 1 accuracy    binary        0.914 
#> 2 specificity binary        0.0457
#> 3 sensitivity binary        0.997
```{r}
augment(tree_fit, new_data = taxi_test) %>% 
  roc_auc(truth = tip, .pred_yes)
```
#> # A tibble: 1 × 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 roc_auc binary         0.602

The performance metrics derived from the held-out test set give the reader an understanding of how the model performs on completely unseen data. This is the ultimate test of the model’s generalizability and is often considered the most important set of metrics to report.

  • Overfitting Discussion

Discuss any signs of overfitting observed (for instance, if there was a large discrepancy between the cross-validation performance and the test set performance).

  • Model Interpretability

It’s often important to discuss what the model’s parameters or predictions mean in the context of the subject area.

Your turn

  • Generate a final report for your regression logistic model, but now using the test set

Completing the whole game

PART 6: Next Steps?

Random forest 🌳🌲🌴🌵🌳🌳🌴🌲🌵🌴🌳🌵

  • Ensemble many decision tree models
    • All the trees vote! 🗳️
    • Bootstrap aggregating + random predictor sampling

Model tuning

Variable importance

  • vip package